In-Memory Analytic DBMS

History

“As DRDB [Disk-Resident DB] perform more and more in-memory optimizations, they become closer to MMDB [Main-Memory DB]. In the future, we expect that the differences between a MMDB and DRDB will disappear: any good database management system will recognize and exploit the fact that some data will reside permanently in memory and should be managed accordingly”

But what about:

Some Early MMDB Systems

Mostly focused on transaction processing (the harder part?)

We’ll read a more modern example via SILO next lecture

Background/Context

“Volcano-style” iterators

Columnar Data Representation

Improves analytic performance in both disk-based and MMDB. Originally designed for disk-based systems, but important background for today.

Super simple idea: column-major rather than row-major layout.

History

The European Systems Resurgence: MMDB

Back in the 70s, there was some notable European systems impact, mostly in Germany, on things like B-trees and transactions. But in the 1980s-90s, European DB researchers were mostly doing theory with a few exceptions. Systems work centered around US academics and companies.

In the 00’s, The Netherlands’ CWI (led by Martin Kersten and later Peter Boncz) and Germany’s T.U. Munich (led by Thomas Neumann and later also Viktor Leis) emerged as new centers of excellence in main-memory DBMS. Also efforts at Wisconsin (Jignesh Patel’s QuickStep), MIT (Stonebraker/Madden et al HStore/Silo) and CMU (Andy Pavlo’s Peloton and NoisePage), but even Andy Pavlo gives most credit to the europeans.

Germany’s SAP Hana also made a huge bet on in-memory around 2010, and the Hasso Plattner Institute backed that with a bunch of papers.

In terms of direct industrial impact, biggest impact has been on embedded/in-memory:

Also HStore became VoltDB, Quickstep startup acquired by Pivotal for HAWQ/Greenplum. These infrastructure MMDB plays have been less visible than DuckDB and Hyper/Tableau.

Lots of commercial activity outside this space (e.g. MemSQL/SingleStore, ClickHouse).

Big Picture: Everything You Always Wanted to Know

Theme: 20th century designs were I/O-throttled, ignored in-memory issues - Need to optimize memory bandwidth (see X11) - Need to optimize register locality (see HyPer) - Cache locality addressed in lots of papers, but it turns out it mostly falls out from the above

MonetDB X100 and Vectorwise

HyPer

Surely a compiler can help us?! Compiled queries are a long tradition in DBMSs, going back to System R!

Where X100 focuses on memory bandwidth, HyPer focuses on register locality!

High-level ideas:

  1. Query algebra is for code structuring, but compiler should break that abstraction to keep data in cache
  2. “Push” beats “Pull” for register locality (??)
  3. In 2011, LLVM is good enough to do what we need, let’s use its IR (yuck)!

Observations:

Recent Work: InkFuse

Hot-off-the-press ICDE24 including TUM and CWI folks! Goals: best of both - 0 compilation time vectorized performance for short queries - benefit from compiled code for long queries

You might ask…specialized HW?